Data scientists are often faced with data sets that contain unstructured text in product review data, and must employ natural language processing (NLP) techniques in order to make it useful. Sentiment analysis refers to the use of NLP techniques to extract subjective information such as the polarity of the text, e.g., whether or not the author is speaking positively or negatively about some topic.
Companies often have useful data which are hidden in large volumes of text such as:
For example, when shopping it can be challenging to decide between products with the same star rating. When this happens, shoppers often sift through the raw text of reviews to understand the strengths and weaknesses of each option.
In this short note, we will show how to use GraphLab Create's sentiment_analysis
toolkit to apply pre-trained models to predict sentiment for text data in these situations. More specifically, we are going to automate the task of determining product strengths and weaknesses from review text by following the steps below:
Finally, we will also show how to use the GraphLab Create's product_sentiment
toolkit to summarize sentiment
about products within reviews written by happy and/or unhappy customers. The products and aspects of interest will be the same as before to help comparisons and conclusions.
Important Note:
GraphLab Create includes feature engineering objects that leverage spaCy
, a high performance NLP package. Here we use it for extracting parts of speech and parsing reviews into sentences.
In [1]:
import graphlab as gl
In [2]:
def nlp_pipeline(reviews, title, aspects):
from graphlab.toolkits.text_analytics import trim_rare_words, split_by_sentence, extract_parts_of_speech, stopwords, PartOfSpeech
print(title)
print('1. Get reviews for this product')
reviews = reviews.filter_by(title, 'name')
print('2. Splitting reviews into sentences')
reviews['sentences'] = split_by_sentence(reviews['review'])
sentences = reviews.stack('sentences', 'sentence').dropna()
print('3. Tagging relevant reviews')
tags = gl.SFrame({'tag': aspects})
tagger_model = gl.data_matching.autotagger.create(tags, verbose=False)
tagged = tagger_model.tag(sentences, query_name='sentence', similarity_threshold=.3, verbose=False)\
.join(sentences, on='sentence')
print('4. Extracting adjectives')
tagged['cleaned'] = trim_rare_words(tagged['sentence'], stopwords=list(stopwords()))
tagged['adjectives'] = extract_parts_of_speech(tagged['cleaned'], [PartOfSpeech.ADJ])
print('5. Predicting sentence-level sentiment')
model = gl.sentiment_analysis.create(tagged, target=None, features=['review'])
tagged['sentiment'] = model.predict(tagged)
return tagged
In [3]:
reviews = gl.SFrame('./amazon_baby.gl/')
In [4]:
reviews
Out[4]:
In [5]:
from helper_util import *
Next, we collect the baby monitor reviews:
In [6]:
reviews = search(reviews, 'monitor')
In [7]:
reviews.print_rows(num_rows=100,max_row_width=300)
In [8]:
for review in reviews['review'][0:10]:
print review, '\n'
First, we define the aspects (product properties) of our current interest:
In [9]:
aspects = ['audio', 'price', 'signal', 'range', 'battery life']
In [10]:
item_a = 'Snuza Baby Monitor, Hero'
reviews_a = nlp_pipeline(reviews, item_a, aspects)
In [11]:
reviews_a
Out[11]:
In [12]:
reviews_a.save('./reviews_a')
In [13]:
dropdown = get_dropdown(reviews)
display(dropdown)
In [14]:
reviews_a = gl.load_sframe('./reviews_a/')
item_b = dropdown.value
print 'Comparing reviews with \'%s\':\n' % item_b
reviews_b = nlp_pipeline(reviews, item_b, aspects)
In [15]:
counts, sentiment, adjectives = get_comparisons(reviews_a, reviews_b, item_a, item_b, aspects)
Comparing the number of sentences that mention each aspect:
In [16]:
counts
Out[16]:
Comparing the sentence-level sentiment for each aspect of each product:
In [17]:
sentiment
Out[17]:
Comparing the use of adjectives for each aspect:
In [18]:
adjectives
Out[18]:
In [19]:
good, bad = get_extreme_sentences(reviews_a)
Print good sentences for the first item, where adjectives and aspects are highlighted.
In [20]:
print_sentences(good['highlighted'])
Print bad sentences for the first item, where adjectives and aspects are highlighted.
In [21]:
print_sentences(bad['highlighted'])
product_sentiment.create()
One can even summarize the sentiment of the baby monitor's reviews (reviews
) by utilizing the GraphLab Create's product_sentiment
toolkit. The toolkit enables to search for aspects of interest (e.g. product properties) and obtain summaries of the reviews or sentences with the most positive (or negative) predicted sentiment.
Note, that since no target
variable is given a pre-trained model will be used.
In [25]:
reviews_sentiment = gl.product_sentiment.create(reviews,
target=None,
features=['review'],
method='auto',
splitby='review')
To get an overview of the top-10 reviews for every product property in the aspects
list below:
In [26]:
print aspects
we make the call:
In [27]:
reviews_sentiment.sentiment_summary(keywords=aspects, groupby='name', k=10, threshold=3)
Out[27]:
Note, that we have grouped the result by the product 'name'
but we have limit summaries to only those having at least three (3
) of product reviews.
While creating the model, several operations are completed under the hood:
NLTK
's punkt
sentence parser.Providing a splitby='sentence'
argument when training the model implies that all analysis should be performed at the sentence-level rather than using the entire text. Thus any calls to sentiment_summary()
will concern predictions for each sentence rather than the entire review.
In [28]:
reviews_sentiment1 = gl.product_sentiment.create(reviews,
target=None,
features=['review'],
method='auto',
splitby='sentence')
In [29]:
reviews_sentiment1.sentiment_summary(keywords=aspects, groupby='name', k=10, threshold=3)
Out[29]:
To have an overview of the internal sentiment model used:
In [37]:
reviews_sentiment1.sentiment_scorer
Out[37]:
To have an overview of the model that searches for text snippets:
In [38]:
reviews_sentiment1.review_searcher
Out[38]: